Introduction to Open Data Science - Course Project

About the Project

Introduction to Open Data Science. Happy to learn to connect Git and R even though the content of data analysis is familiar to me. I’ve used R MarkDown before and found it useful. In addition I’ve used Git before on Tietokantojen Perusteet course. However, it’s been a long time so recalling things is needed. Here’s the link to my IODS GitHub repository.

Today is the following day.

## [1] "Thu Dec  3 12:01:14 2020"

Reminders

I wanted to remind me what I’ve done last time with R MarkDown. I found a nice data on exchange and forward rates. I make a table and a graph of Sterling/EUR exchange rate. I uploaded forward2.dat to git repository and I call my data set from there in order to plot the rate. I hide the R code to make the outcome more readable.

Table of the exchange and forward rates

## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
EXUSBP EXUSEUR EXEURBP F1USBP F1USEUR F1EURBP F3USBP F3USEUR F3EURBP
Min. :1.073 Min. :0.5827 Min. :0.3567 Min. :1.067 Min. :0.5873 Min. :0.3588 Min. :1.061 Min. :0.5941 Min. :0.3624
1st Qu.:1.507 1st Qu.:0.8876 1st Qu.:0.4855 1st Qu.:1.504 1st Qu.:0.8885 1st Qu.:0.4845 1st Qu.:1.501 1st Qu.:0.8926 1st Qu.:0.4841
Median :1.617 Median :1.0738 Median :0.5598 Median :1.616 Median :1.0774 Median :0.5589 Median :1.609 Median :1.0855 Median :0.5588
Mean :1.665 Mean :1.0416 Mean :0.6213 Mean :1.663 Mean :1.0447 Mean :0.6203 Mean :1.658 Mean :1.0503 Mean :0.6184
3rd Qu.:1.756 3rd Qu.:1.1741 3rd Qu.:0.7107 3rd Qu.:1.753 3rd Qu.:1.1778 3rd Qu.:0.7091 3rd Qu.:1.748 3rd Qu.:1.1819 3rd Qu.:0.7051
Max. :2.443 Max. :1.4222 Max. :1.6002 Max. :2.441 Max. :1.4240 Max. :1.5954 Max. :2.433 Max. :1.4278 Max. :1.5863

Graph of the GBP/EUR exchange rate.

Latex

I’m happy to see that R MarkDown is linked to LaTeX syntax! Here’s a maximization problem from my current research on the optimal mechanism design with enforcement: \[\begin{align} \max_{r(\cdot),t(\cdot),m(\cdot)} \mathbb{E} \left[ t(\theta) - K m(\theta) \right]\label{OBJ} \tag{MAX} \end{align}\] subject to \[\begin{align} &t(\theta) = \theta r(\theta) - V(\underline{\theta}) - \int_{\underline{\theta}}^{\theta} \mathcal{I}(s|s)ds \label{TAX} \tag{TAX} \\ &\mathcal{I}(\theta|\theta) \qquad \text{is nondecreasing} \label{IC} \tag{IC} \\ &\mathcal{I}(\theta|\theta) \geq 0 \label{IR} \tag{IR} \end{align}\] for all \(\theta \in \Theta\) with \(\mathcal{I}(\theta|\theta):=r(\theta) - m(\theta) \varphi\).


Regression and Model Validation

Today is

## [1] "Thu Dec  3 12:01:15 2020"

In this week we learn how to create data and analyse it. We read the data from the given URL. The meta descriptions can be found from here.

The dataset queries the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. The original data have 183 responders (observations) for 97 questions (variables). For our purposes, we subset the dataset to contain variables gender, age, attitude, deep, stra, surf and points. The variables are either integers or numerics. The meta-file gives the following description for each variable:

Next we omit the observations where the exam points variable is zero. After that we scale variables by the mean – that is, we divide each combination variable by its number of questions asked. So we have 166 observations for 7 variables left.

Descriptive Analysis

Let us next read the data we created earlier in the data wranling part:

# Read the data
learning2014 <- read.csv("/Users/teemupekkarinen/IODS-project/data/learning2014",row.names = 1)

To summarize our dataset’s content we run

# Structure
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ Gender  : int  2 1 2 1 1 2 1 2 1 2 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ Deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
# Dimension
dim(learning2014)
## [1] 166   7

That is, we have the information about gender, age, attitude, and three different learning approaches (deep, strategic, and surface) of 166 responders. All of the variables are numeric and the combination variables (attitude, deep, stra, and surf) are means. For more detailed description we run:

# Summary
summary(learning2014)
##      Gender           Age           Attitude          Deep      
##  Min.   :1.000   Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  1st Qu.:1.000   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Median :2.000   Median :22.00   Median :3.200   Median :3.667  
##  Mean   :1.663   Mean   :25.51   Mean   :3.143   Mean   :3.680  
##  3rd Qu.:2.000   3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##  Max.   :2.000   Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

We observe that there are almost twice more female responders than male responders: 56 men and 110 women. The average age of the responders is 26 and the youngest person is 17 years old and the oldest 55 years old. The age distribution is wide but skewed towards young people. As summarize we can say than a typical responder is a young female.

# Histrograms
Gender <- learning2014$Gender
hist(Gender,breaks=2)

sum(Gender==1)
## [1] 56
Age <- learning2014$Age
hist(Age)

mean(Age)
## [1] 25.51205
var(Age)
## [1] 60.31198

The average exam points is 28 with maximum 33 and minimum 7. The exam results are more or less normally distributed among all participants.

# Histrograms
Points <- learning2014$Points
hist(Points)

However, when we compare the points between men and female, we observe that the mean of exam points for men is slightly greater than for women. There are even more men with score 30 or 31 than women.

# Package
library(ggplot2)

# Histrograms
PointsMale <- learning2014$Points[Gender==1]
mean(PointsMale)
## [1] 23.48214
PointsFemale <- learning2014$Points[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(PointsMale, PointsFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

The mean of attitude, 3.14, is almost neutral 3. That is, on average responders have a bit positive attitude towards statistics. Also attitude is nearly normally distributed. Male responders have more positive attitude than female.

# Histrograms
Attitude <- learning2014$Attitude
hist(Attitude)

AttitudeMale <- learning2014$Attitude[Gender==1]
mean(PointsMale)
## [1] 23.48214
AttitudeFemale <- learning2014$Attitude[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(AttitudeMale, AttitudeFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

The minority prefers surface learning approach to learning in the statistics course, whereas the majority prefers the methods of deep learning. Moreover, the greatest variation in learning methods is in strategic learning. Deep and surface learning is almost identically distributed expect deep learning has a greater mean.

# Package
library(ggplot2)

# Histrograms
Deep <- learning2014$Deep
Stra <- learning2014$Stra
Surf <- learning2014$Surf
data <- data.frame(
  type = c( rep("Deep", 166), rep("Stra", 166), rep("Surf", 166) ),
  value = c(Deep, Stra, Surf)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

There is no much difference between genders in learning methods; the distributions between male and female responders are almost indentically distributed.

# Histrograms

# Strategic Learning
StraMale <- learning2014$Stra[Gender==1]
StraFemale <- learning2014$Stra[Gender==2]
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(StraMale, StraFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

# Deep Learning
DeepMale <- learning2014$Deep[Gender==1]
DeepFemale <- learning2014$Deep[Gender==2]
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(DeepMale, DeepFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

# Surface Learning
SurfMale <- learning2014$Surf[Gender==1]
SurfFemale <- learning2014$Surf[Gender==2]
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(SurfMale, SurfFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

Regressions

Next we run regressions where exam points is our dependent variable. Let us first look at the correlations between exam points and the explanatory variables.

plot(Attitude, Points, main = "Regression",
     xlab = "Attitude", ylab = "Points") + abline(lm(Points~Attitude), col = "blue")

## integer(0)
plot(Gender, Points, main = "Regression",
     xlab = "Gender", ylab = "Points") + abline(lm(Points~Gender), col = "blue")

## integer(0)
plot(Age, Points, main = "Regression",
     xlab = "Age", ylab = "Points") + abline(lm(Points~Age), col = "blue")

## integer(0)
plot(Deep, Points, main = "Regression",
     xlab = "Deep Learning", ylab = "Points") + abline(lm(Points~Deep), col = "blue")

## integer(0)
plot(Stra, Points, main = "Regression",
     xlab = "Strategic Learning", ylab = "Points") + abline(lm(Points~Stra), col = "blue")

## integer(0)
plot(Surf, Points, main = "Regression",
     xlab = "Surface Learning", ylab = "Points") + abline(lm(Points~Surf), col = "blue")

## integer(0)
cov(learning2014)
##               Gender        Age    Attitude        Deep        Stra        Surf
## Gender    0.22489960 -0.4383352 -0.10184739 -0.01526713  0.05326762  0.02826457
## Age      -0.43833516 60.3119752  0.12584520  0.10792260  0.61406079 -0.58075332
## Attitude -0.10184739  0.1258452  0.53276561  0.04458987  0.03478322 -0.06776013
## Deep     -0.01526713  0.1079226  0.04458987  0.30706766  0.04127419 -0.09489017
## Stra      0.05326762  0.6140608  0.03478322  0.04127419  0.59572437 -0.06570525
## Surf      0.02826457 -0.5807533 -0.06776013 -0.09489017 -0.06570525  0.27967222
## Points   -0.25972983 -4.2662651  1.87824388 -0.03315078  0.66483662 -0.45002434
##               Points
## Gender   -0.25972983
## Age      -4.26626506
## Attitude  1.87824388
## Deep     -0.03315078
## Stra      0.66483662
## Surf     -0.45002434
## Points   34.74965316

From the single regressions and the covariance matrix we observe that there is negative correlation between points and gender, age, deep learning and surface learning. That is, what we already observed, men were doing better in the course than women. Moreover, it seems to be so that younger students got better exam results than the older participants. Strategic learning was the only method that has positive correlation between exam results. There is a big positive correaltion between attitude and points.

Next we choose first gender, attitude and age to regress exam points.

reg <- lm(Points~Gender+Attitude+Age)
summary(reg)
## 
## Call:
## lm(formula = Points ~ Gender + Attitude + Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.76802    3.17695   4.019 8.93e-05 ***
## Gender       0.33054    0.91934   0.360    0.720    
## Attitude     3.60657    0.59322   6.080 8.34e-09 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

From here we observe that attitude is the only significant explanatory variable by itself. However, by the F-test all three variables are together significant explanatories.

Nevertheless, following the instructions we drop off gender variable and run the regression again.

# Regression
reg <- lm(Points~Attitude+Age)
summary(reg)
## 
## Call:
## lm(formula = Points ~ Attitude + Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3354  -3.3095   0.2625   4.0005  10.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.57244    2.24943   6.034 1.04e-08 ***
## Attitude     3.54392    0.56553   6.267 3.17e-09 ***
## Age         -0.07813    0.05315  -1.470    0.144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.1913 
## F-statistic: 20.52 on 2 and 163 DF,  p-value: 1.125e-08

Attitude remains to be the strongest predictor. Age is still unsignifant even though together with attitude it is significant (F-test). It turns out, after trying all possible combinations of explanatory variables, only regressing with attitude we get 5 % significant level of all regressors. We thus lastly regress only with attitude and get the following.

reg <- lm(Points~Attitude)
summary(reg)
## 
## Call:
## lm(formula = Points ~ Attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

From the summary we observe that the intercept term is positive and significant: 11.63 with standard error of 1.83. This is the test score if attitude was “zero”. Each unit jump in attitution increases (not causal) exam results by 3.5 points. Hence, with maximum attitude our model predicts that the test result is \(11.63 + 5 * 3.5 = 29\) points.

We observe also that R-squared decreases a little everytime we omit an explanatory variable from the regression. R-squared is the measure that represents the proportion of the variance for the dependent variable that is explained by explanatory variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.

Lastly we plot the regression diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

plot(reg)

Residuals vs Fitted values figure is a simple scatterplot between residuals and predicted values. It should look more or less random, which is roughly the case in our analysis. The red line is not exactly zero all the time, which is a sign of that the residuals may have a little positive predictive power (omitted variable bias).

The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In our cases the tails scarper from the diagonal which may imply that we have some problems with assumption of normally distributed error terms. Probably some t-distribution could be better.

The last plot for the residuals vs leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). It turns out that points 35, 56, and 71 have the strongest effects on the dependent variable.

Causal Inference

For the causal inference we need to have very strong assumptions.

Before going to the assumptions, we introduce the vector and matrix notation. Define K-dimensional (column) vectors \(\boldsymbol{x}_i\) and \(\boldsymbol{\beta}\) as

\[ \underset{(K \times 1)}{\boldsymbol{x}_i} = \begin{pmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{iK} \end{pmatrix}, \underset{(K \times 1)}{\boldsymbol{\beta}} = \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{K} \end{pmatrix}.\] Also define \[ \underset{(n \times 1)}{\boldsymbol{y}} = \begin{pmatrix} y_{1} \\ \vdots \\ y_{n} \end{pmatrix}, \underset{(n \times 1)}{\boldsymbol{\varepsilon}} = \begin{pmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{pmatrix}, \underset{(n \times K)}{\boldsymbol{X}} = \begin{pmatrix} \boldsymbol{x}'_{1} \\ \vdots \\ \boldsymbol{x}'_{n} \end{pmatrix} = \begin{pmatrix} x_{11} & \dots & x_{1K} \\ \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{nK} \end{pmatrix}. \] Scalar quantities will be given in normal font \(x\), vector quantities in bold lowercase \(\boldsymbol{x}\), and all vectors will be presumed to be column vectors. Matrix quantities will be in bold uppercase \(\boldsymbol{X}\). The transpose of a matrix is denoted by either \(\boldsymbol{X}'\) or \(\boldsymbol{X}^T\).

Using the matrix notation, the assumptions for the causal inference are the following

Assumption 1.1 (linearity): \[\underset{(n \times 1)}{\boldsymbol{y}} = \underset{\underbrace{(n \times K)(K \times 1)}_{(n \times 1)}}{\boldsymbol{X} \: \: \: \boldsymbol{\beta}} + \underset{(n \times 1)}{\boldsymbol{\varepsilon}}.\]

Assumption 1.2 (strict exogeneity): \[\mathbb{E}(\varepsilon_{i}|\boldsymbol{X}) = 0 \: \: \: (i = 1, 2, \dots, n).\]

Assumption 1.3 (no multicollinearity): The rank of the \(n \times K\) data matrix, \(\boldsymbol{X}\), is \(K\) with probability 1.

Assumption 1.4 (spherical error variance): \[(\text{homoskedasticity}) \: \: \mathbb{E}(\varepsilon_{i}^2|\boldsymbol{X}) = \sigma^2 > 0 \: \: \: (i = 1, 2, \dots, n),\] \[(\text{no correlation between observations}) \: \: \: \mathbb{E}(\varepsilon_{i}\varepsilon_{j}|\boldsymbol{X}) = 0 \: \: \: (i, j = 1, 2, \dots, n; i \neq j).\]

Even though our regression diagnostic plots are promising, I would not make any kind of causal interpretations from our model.


Logistic regression

Today is

## [1] "Thu Dec  3 12:01:18 2020"

As you can see from the date stamp, I am doing this exercise that I missed due to personal issues. However, I wanted to learn also this part of the course and thus decided to do the exercises whether that is necessary or not.

Alcohol Consumption Data

We start by reading the data we created in the data wrangling part

joinedData <- read.csv("/Users/teemupekkarinen/IODS-project/data/joinedData.csv",row.names = 1)

and use the glimpse to our data.

library(dplyr)
glimpse(joinedData) 
## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
## $ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ age        <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15…
## $ address    <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U"…
## $ famsize    <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "L…
## $ Pstatus    <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T"…
## $ Medu       <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3…
## $ Fedu       <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2…
## $ Mjob       <chr> "at_home", "other", "at_home", "services", "services", "se…
## $ Fjob       <chr> "other", "other", "other", "health", "services", "health",…
## $ reason     <chr> "home", "reputation", "reputation", "course", "reputation"…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no…
## $ internet   <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "ye…
## $ guardian   <chr> "mother", "mother", "mother", "mother", "other", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1…
## $ studytime  <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2…
## $ failures   <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ schoolsup  <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no"…
## $ famsup     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ paid       <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "n…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no",…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ romantic   <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "…
## $ famrel     <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3…
## $ freetime   <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3…
## $ goout      <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2…
## $ Dalc       <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
## $ Walc       <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1…
## $ health     <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3…
## $ absences   <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, …
## $ G1         <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12…
## $ G2         <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 1…
## $ G3         <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14…
## $ alc_use    <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5…
## $ high_use   <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…

We have 382 observations and 35 variables. There are individual level background variables like gender, age, school performance, and alcohol consumption. Moreover, we have variables that are related to individuals parental background like family size, parents’ cohabitation status, and parental education status.

The summary of the data is the following.

summary(joinedData)
##     school              sex                 age          address         
##  Length:382         Length:382         Min.   :15.00   Length:382        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.59                     
##                                        3rd Qu.:17.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:382         Length:382         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :3.000  
##                                        Mean   :2.806   Mean   :2.565  
##                                        3rd Qu.:4.000   3rd Qu.:4.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            nursery         
##  Length:382         Length:382         Length:382         Length:382        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    internet           guardian           traveltime      studytime    
##  Length:382         Length:382         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :1.000   Median :2.000  
##                                        Mean   :1.448   Mean   :2.037  
##                                        3rd Qu.:2.000   3rd Qu.:2.000  
##                                        Max.   :4.000   Max.   :4.000  
##     failures       schoolsup            famsup              paid          
##  Min.   :0.0000   Length:382         Length:382         Length:382        
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.2016                                                           
##  3rd Qu.:0.0000                                                           
##  Max.   :3.0000                                                           
##   activities           higher            romantic             famrel     
##  Length:382         Length:382         Length:382         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.937  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime        goout            Dalc            Walc           health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.000   Median :4.000  
##  Mean   :3.22   Mean   :3.113   Mean   :1.482   Mean   :2.296   Mean   :3.573  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     absences          G1              G2              G3           alc_use     
##  Min.   : 0.0   Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 1.0   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median : 3.0   Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   : 4.5   Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.: 6.0   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :45.0   Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

We are going to the following four variables in our analysis:

  • Student’s gender
  • Age
  • Father’s Education
  • Absence

By this choices we want to see if there is evidence that males drink more and control this by age. Father’s education we use as a proxy variables for socioeconomic status. This gives us insight if people in lower socioeconomic status consume more alcohol. Lastly, we control alcohol consumption by school attendance.

To that end, we manipulate our data as follows.

library(tidyr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(dplyr)
analysisAlc <- joinedData %>%
  select(sex, age, Fedu, absences, high_use)
analysisAlc$Fedu <- cut(analysisAlc$Fedu, breaks = c(-Inf, 0, 1, 2, 3, Inf), labels= c("None", "primaryLow", "primaryHigh", "secondary", "Tertiary")  )
analysisAlc$Fedu <- factor(analysisAlc$Fedu)
summary(analysisAlc)
##      sex                 age                 Fedu        absences   
##  Length:382         Min.   :15.00   None       :  2   Min.   : 0.0  
##  Class :character   1st Qu.:16.00   primaryLow : 77   1st Qu.: 1.0  
##  Mode  :character   Median :17.00   primaryHigh:105   Median : 3.0  
##                     Mean   :16.59   secondary  : 99   Mean   : 4.5  
##                     3rd Qu.:17.00   Tertiary   : 99   3rd Qu.: 6.0  
##                     Max.   :22.00                     Max.   :45.0  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

Descriptive Analysis

Let us next do some descriptive analysis for our data. We begin by plotting historgrams.

gather(analysisAlc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

From here we can see that approximately 1/3 of the sample are high users of alcohol. The individuals almost equally shared to female and male youngsters, mostly 15-18 years old, and their farthers are well educated. The distribution of absences in school has dispersion; there are students who are almost never absent and students whose are absent very often.

Next we draw some boxplots.

box <- ggplot(analysisAlc, aes(x = high_use  , y = absences, col = Fedu ))
box + geom_boxplot() + ylab("Absences")

box2 <- ggplot(analysisAlc, aes(x = high_use  , y = age, col =  Fedu  ))
box2 + geom_boxplot() + ylab("age")

The median number of absences is slightly greater among high alcohol users. Moreover, it seems to be so that high alcohol users are older (the median is higher). This, naturally, sounds reasonable statistics.

Next we do a table that shows the fraction of high alcohol users by the father’s educational background. It seems to be so there is no connection between high alcohol use and father’s education.

library(knitr)
library(dplyr)
table <- analysisAlc %>%
  group_by(Fedu) %>%
  summarise(mean_high_use=mean(high_use))
## `summarise()` ungrouping output (override with `.groups` argument)
kable(table)
Fedu mean_high_use
None 0.0000000
primaryLow 0.3116883
primaryHigh 0.2857143
secondary 0.2727273
Tertiary 0.3333333

Next we analyse alcohol use by age and gender.

library(tidyr)
table <- analysisAlc %>%
  group_by(sex) %>%
  summarise(mean_high_use=mean(high_use))
## `summarise()` ungrouping output (override with `.groups` argument)
kable(table)
sex mean_high_use
F 0.2121212
M 0.3913043

This indicates that boys drink more alcohol than girls.

Logistic Regression

In this section we use logistic regression to highlight our descriptive findings.

We start by running the model and summarizing the results

model <- glm(high_use ~ sex + absences + age + Fedu,  data = analysisAlc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ sex + absences + age + Fedu, family = "binomial", 
##     data = analysisAlc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2790  -0.8479  -0.6231   1.0561   2.1848  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -18.41011  623.86680  -0.030    0.976    
## sexM              1.00410    0.24205   4.148 3.35e-05 ***
## absences          0.09335    0.02326   4.014 5.98e-05 ***
## age               0.16813    0.10351   1.624    0.104    
## FeduprimaryLow   13.90961  623.86478   0.022    0.982    
## FeduprimaryHigh  13.70235  623.86475   0.022    0.982    
## Fedusecondary    13.59776  623.86476   0.022    0.983    
## FeduTertiary     13.96070  623.86475   0.022    0.982    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.95  on 374  degrees of freedom
## AIC: 439.95
## 
## Number of Fisher Scoring iterations: 13
coef(model)
##     (Intercept)            sexM        absences             age  FeduprimaryLow 
##    -18.41010883      1.00410227      0.09335241      0.16813210     13.90961364 
## FeduprimaryHigh   Fedusecondary    FeduTertiary 
##     13.70235173     13.59776078     13.96070194

All the coefficients of the regression are positive. Hence, if an individual is old, male, and has a high number of absences, the probability of high alcohol consumption is greater. However, the effects of age and father’s education are not significant.

Let us next calculate odds ratios for the variables.

OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cbind(OR, CI)
##                           OR        2.5 %       97.5 %
## (Intercept)     1.010628e-08           NA 1.713949e+33
## sexM            2.729456e+00 1.708562e+00 4.421144e+00
## absences        1.097849e+00 1.051069e+00 1.151547e+00
## age             1.183093e+00 9.669145e-01 1.452162e+00
## FeduprimaryLow  1.098673e+06 3.612856e-37           NA
## FeduprimaryHigh 8.930088e+05 2.829941e-37           NA
## Fedusecondary   8.043267e+05 2.570223e-37           NA
## FeduTertiary    1.156261e+06 3.630630e-37           NA

Here we again see that age and father’s education are not statistically significant.Therefore let us drop these two and run the model again.

model2 <- glm(high_use ~ sex + absences,  data = analysisAlc, family = "binomial")
OR <- coef(model2) %>% exp
CI <- confint(model2) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                   OR     2.5 %    97.5 %
## (Intercept) 0.159445 0.1012577 0.2427684
## sexM        2.658116 1.6710354 4.2863129
## absences    1.101409 1.0549317 1.1548057

Now the odds ratio for the male indicator is statistically significant. The coefficient value 2.66 means that if a person is a male, the odds of high alcohol consumption is 2.6 times greater when absences are held constant. The odds ratio of abcenses is 1.1 indicating that if absences increase by one, the odds of high alcohol usage increases by 1.1.

Next we make some predictions.

library(visreg)
library(caret)
## Loading required package: lattice
library(sjlabelled)
## 
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:dplyr':
## 
##     as_label
model2 <- glm(high_use ~ sex + absences, data = analysisAlc, family = "binomial")
probs <- predict(model2, type = "response")
analysisAlc<- mutate(analysisAlc, probability = probs)
analysisAlc <- mutate(analysisAlc, prediction = probability > 0.5)
table(high_use = analysisAlc$high_use, prediction = analysisAlc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     88   26
table(high_use = analysisAlc$high_use, prediction = analysisAlc$prediction) %>%
    prop.table() %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.02617801 0.70157068
##    TRUE  0.23036649 0.06806283 0.29842932
##    Sum   0.90575916 0.09424084 1.00000000
g <- ggplot(analysisAlc, aes(x = probability, y = high_use , col = prediction))
g +  geom_point()

We observe many false-negative predictions. That is, it is likely that our model indicates that a person does not consume lot of alcohol in the case if she or he really does so.

Based on the following loss functions we observe that approximately 1/4 our model’s predictions are wrong. The latter loss function value says our model is still better than just randomization.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = analysisAlc$high_use, prob = analysisAlc$probability)
## [1] 0.2565445
loss_func(class = analysisAlc$high_use, prob = 0.333)
## [1] 0.2984293

Clustering and Classification

Today is

## [1] "Thu Dec  3 12:01:22 2020"

This week we use the Boston data from the MASS package. The data have 14 variables and 506 observartions for each variable. The variables are either numerical or integers.

# Package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Namely, the variables are

The correlation matrix can be illustrated by using corrplot package as follows

library(corrplot)
## corrplot 0.84 loaded
corr_matrix<-cor(Boston)
corrplot(corr_matrix)

That is, there is a a significant positive correlation between crim, rad, tax and lsat. On the other hand, the weighted mean of distances to five Boston employment centres, dis, has negative correlation with indus, nox, and age.

However, since the crime rate seems to be the variable in interest, let us illustrate it by some graphics.

require(ggplot2)
require(plotly)
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data = Boston, x = ~crim, type = "histogram")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(data = Boston, x = ~rad, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~tax, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~lstat, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Next we standardize the dataset and print out summaries of the scaled data.

scaled.Boston <- data.frame(scale(Boston))

Now all variables have mean of \(0\) and variance \(1\). For instance, now the plot of crim and lstat is the following.

plot_ly(data = scaled.Boston, x = ~lstat, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Next we divide crim in 5 categories based on the 20/%-quantiles.

library(gtools)
q.crim <- quantcut(scaled.Boston$crim,q = 5)
summary(q.crim)
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]  (-0.356,0.229]    (0.229,9.92] 
##             102             101             101             101             101
scaled.Boston$crim <- q.crim

We divide the dataset into train and test sets, so that 80/% of the data belongs to the train set.

sample <- sample.int(n = nrow(scaled.Boston), size = floor(.8*nrow(scaled.Boston)), replace = F)
train <- scaled.Boston[sample, ]
test  <- scaled.Boston[-sample, ]

The linear discriminant analysis:

# MASS and train are available

# linear discriminant analysis
lda.fit <- lda(crim~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crim ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]  (-0.356,0.229]    (0.229,9.92] 
##       0.1955446       0.2004950       0.1980198       0.2103960       0.1955446 
## 
## Group means:
##                         zn      indus        chas        nox          rm
## [-0.419,-0.413]  1.2289308 -0.9919693 -0.12281893 -0.9117984  0.52532136
## (-0.413,-0.403]  0.1178038 -0.5224852 -0.02929819 -0.6534455  0.09327762
## (-0.403,-0.356] -0.2816979 -0.2714370  0.12138096 -0.2874372  0.01752820
## (-0.356,0.229]  -0.4267078  0.5746297  0.23717802  0.8207357 -0.19851700
## (0.229,9.92]    -0.4872402  1.0149946 -0.07298222  1.0253163 -0.29879457
##                        age        dis         rad        tax     ptratio
## [-0.419,-0.413] -0.8925501  0.9531282 -0.71438014 -0.7735768 -0.48814134
## (-0.413,-0.403] -0.4732191  0.3753418 -0.55793090 -0.5069924 -0.26401726
## (-0.403,-0.356] -0.0190914  0.1186790 -0.46793221 -0.4469321  0.02111665
## (-0.356,0.229]   0.6472870 -0.5257096  0.09228323  0.2269943 -0.14629009
## (0.229,9.92]     0.8525123 -0.8933315  1.65960290  1.5294129  0.80577843
##                      black       lstat        medv
## [-0.419,-0.413]  0.3872598 -0.80899932  0.58453774
## (-0.413,-0.403]  0.3562116 -0.41520978  0.25859279
## (-0.403,-0.356]  0.2987226 -0.01308464  0.09306665
## (-0.356,0.229]  -0.1935722  0.14199508 -0.07891044
## (0.229,9.92]    -0.9427289  0.97919705 -0.79055113
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3         LD4
## zn      -0.012130754  0.83213259  1.10936204 -0.25605164
## indus    0.154075566 -0.31077073  0.24012061  0.14949310
## chas    -0.074203341 -0.06224915 -0.07601345 -0.07200118
## nox      0.402418638 -0.90518526  1.26391889 -0.75211814
## rm       0.075865601  0.28911936 -0.23020678 -0.58379768
## age      0.243579699 -0.30414585  0.36050063 -0.20823905
## dis     -0.066911227 -0.69397644  0.21749263 -0.54758555
## rad      1.624367299  1.05092627  0.09523746 -0.12096880
## tax     -0.014558433 -0.19393270 -0.61632646  1.00328645
## ptratio -0.004230517  0.07997125  0.16193298 -0.83765381
## black   -0.166249120 -0.01752590 -0.15002556 -0.08766222
## lstat    0.320361204  0.13178705 -0.74935128 -1.11112675
## medv     0.039301801 -0.47588448  0.03455914 -0.43736581
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.8540 0.0933 0.0456 0.0071
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crim)

# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Next we cross tabulate the results with the crime categories from the test set.

# lda.fit, correct_classes and test are available
correct_classes <- test$crim

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]
##   [-0.419,-0.413]              12               8               3
##   (-0.413,-0.403]               2              13               4
##   (-0.403,-0.356]               0               6              10
##   (-0.356,0.229]                0               0               4
##   (0.229,9.92]                  0               0               0
##                  predicted
## correct           (-0.356,0.229] (0.229,9.92]
##   [-0.419,-0.413]              0            0
##   (-0.413,-0.403]              1            0
##   (-0.403,-0.356]              5            0
##   (-0.356,0.229]               6            6
##   (0.229,9.92]                 0           22

Let us next reload the Boston dataset and standardize it. Then we calculate the distances between the observations.

# load MASS and Boston
library(MASS)
data('Boston')

# standardization
Boston <- scale(Boston)

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)


Dimensionality Reduction Techniques

Today is

## [1] "Thu Dec  3 12:01:30 2020"

This week we use the ‘human’ dataset originates from the United Nations Development Programme. Graphical overview of the data and show summaries of the variables in the data is the following. We have 155 observations and 8 variables.

# Read the data
human <- read.csv("/Users/teemupekkarinen/IODS-project/data/human",row.names = 1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  0.198 0.931 0.861 0.977 0.989 ...
##  $ Labo.FM  : num  0.199 0.685 0.211 0.633 0.747 ...
##  $ Edu.Exp  : num  9.3 11.8 14 17.9 12.3 20.2 15.7 11.9 12.6 14.4 ...
##  $ Life.Exp : num  60.4 77.8 74.8 76.3 74.7 82.4 81.4 70.8 75.4 76.6 ...
##  $ GNI      : int  1885 9943 13054 22050 8124 42261 43869 16428 21336 38599 ...
##  $ Mat.Mor  : int  400 21 89 69 29 6 4 26 37 22 ...
##  $ Ado.Birth: num  86.8 15.3 10 54.4 27.1 12.1 4.1 40 28.5 13.8 ...
##  $ Parli.F  : num  27.6 20.7 25.7 36.8 10.7 30.5 30.3 15.6 16.7 15 ...
dim(human)
## [1] 155   8
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Namely, the variables are

The data sorted by the GNI

set.seed(1)
s <- sample(1:nrow(human),20)
subdata <- human[s,]
sort.GNI <- subdata$GNI[order(subdata$GNI,decreasing = T)]
barplot(sort.GNI, names.arg = row.names(subdata[order(subdata$GNI,decreasing = T),]), las=2)

The scatter plot of Life.Exp and GNI

library(ggplot2)
library(ggrepel)
list <- c("Australia",
"Azerbaijan","Belgium","Canada", "Chile", "China", "Cuba", "Denmark", 
"Finland", "France", "Spain", "United States", "United Kingdom", "Germany", "India", "Greece", "Ghana", "Israel", "Hungary", "Niger", "Argentina", "Mexico", "Venezuela", "Mongolia", "Morocco", "Nepal", "Namibia", "Pakistan", "Peru", "Philippines", "Romania", "Tajikistan", "Tunisia", "Senegal", "Zambia")
s <- which(rownames(human) %in% list)
p <- ggplot(human[s,], aes(Life.Exp, GNI)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Life expectancy at birth')
p + geom_text_repel(aes(label = rownames(human[s,])),
                    size = 3.5) 

The scatter plot of Edu.Exp and GNI

p <- ggplot(human[s,], aes(Edu.Exp, GNI)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Expected years in schooling')
p + geom_text_repel(aes(label = rownames(human[s,])),
                    size = 3.5) 

The scatter plot of Edu2.FM and GNI

p <- ggplot(human[s,], aes(Edu2.FM, GNI)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'The ratio of Female and Male populations with secondary education')
p + geom_text_repel(aes(label = rownames(human[s,])),
                    size = 3.5) 

The scatter plot of Labo.FM and GNI

p <- ggplot(human[s,], aes(Labo.FM, GNI)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'The ratio of labour force participation of females and males')
p + geom_text_repel(aes(label = rownames(human[s,])),
                    size = 3.5) 

The scatter plot of Parli.F and GNI

p <- ggplot(human[s,], aes(Parli.F, GNI)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Women\'s participation in parliament (percent)')
p + geom_text_repel(aes(label = rownames(human[s,])),
                    size = 3.5) 

To summarize the correlations betweem variables let us plot the correlation matrix.

library(corrplot)
corr_matrix<-cor(human)
corrplot(corr_matrix)

Principal Component Analysis

Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).

Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.

Let us next perform the PCA for our human data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM    5.607472e-06 -0.0006713951 -3.412027e-05 -2.736326e-04  0.0022935252
## Labo.FM   -2.331945e-07  0.0002819357  5.302884e-04 -4.692578e-03 -0.0022190154
## Edu.Exp    9.562910e-05 -0.0075529759  1.427664e-02 -3.313505e-02 -0.1431180282
## Life.Exp   2.815823e-04 -0.0283150248  1.294971e-02 -6.752684e-02 -0.9865644425
## GNI        9.999832e-01  0.0057723054 -5.156742e-04  4.932889e-05  0.0001135863
## Mat.Mor   -5.655734e-03  0.9916320120  1.260302e-01 -6.100534e-03 -0.0266373214
## Ado.Birth -1.233961e-03  0.1255502723 -9.918113e-01  5.301595e-03 -0.0188618600
## Parli.F    5.526460e-05 -0.0032317269 -7.398331e-03 -9.971232e-01  0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02 -6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02 -7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01  3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01 -5.380452e-03  2.281723e-03
## GNI       -2.711698e-05  8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03 -1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02  8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02  2.642548e-03  2.680113e-03
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

library(ggfortify)
pca.plot <- autoplot(pca_human, data = human, colour = 'GNI')
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
pca.plot

And the same for standarized data:

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(scale(human))
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM    0.35664370 -0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM   -0.05457785 -0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp    0.42766720 -0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp   0.44372240  0.02530473  0.10991305 -0.05834819  0.1628935
## GNI        0.35048295 -0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor   -0.43697098 -0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth -0.41126010 -0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F    0.08438558 -0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)

pca.plot <- autoplot(pca_human, data = scale(human), colour = 'GNI')
pca.plot

In PCA scale is necessary as it removes the biases in the original variables. Especially in our case in which we have variables in different units. The standardized variables will be unitless and have a similar variance, which allows to divide the data in components that try to explain the variance of the data as much as possible(!). Hence, it is reasonable to interpret only the results for the standardised data.

The first principal component is able to capture 54 percent of the total variability of the data. The second component 16 percent and the third 10 percent. The lower components explain less than 10 percent in total of 30 percent. That is, three first principal components can explain 80 percent of the total variability of the whole human data.

From the biplot and loadings plot, we can see the variables Labo.FM and Parli.F are highly associated and form a cluster. On the other hand, variables Ado.Birth and Mat.Mor form a cluster and are associated as well. And the third and the greatest cluster is formed by Edu.Exp, Life.Exp,Edu2.FM, and GNI. In clusters there are a strong positive correlation between the variables. The length of PCs in biplot refers to the amount of variance contributed by the PCs. The longer the length of PC, the higher the variance contributed and well represented in space. If the variables are highly associated, the angle between the variable vectors should be as small as possible in the biplot. It seems to be so that the clusters are divided in three: women rights, child mortality, and education. The first principal component explains the variation in child mortality and education, and the second principal component the women rights.

Tea dataset

Lastly we load the tea dataset from the package FactoMineR and explore the data briefly. The data used here concern a questionnaire on tea. They asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions). A data frame with 300 rows and 36 columns. Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.

library(FactoMineR)
library(dplyr)
library(tidyr)

# reduced dataset
data(tea)
tea_time <- select(tea, c("Tea", "How", "how", "sugar", "where", "lunch"))
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300   6
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

To summarize the data we find the following. People mostly drink tea in tea bags and without lemon and milk. The most common time to drink tea is not lunch time. People are divided between drinking tea with and without sugar. Earl Grey is the most popular brand and it people buy their tea from chain stores.

Based on the MCA we can conclude that unpacked tea is bought from tea shops and tea bags from chain stores. There is a cluster of drinking tea in outside of lunch time without any additional ingredients (milk or lemon). Moreover, it seems to be so that there is a little correlation between drinking Earl Grey with milk. Green tea is


Analysis of Longitudinal Data Part 1

Today is

## [1] "Thu Dec  3 12:01:37 2020"

This week we practise the analysis of longitudinal data. We will make graphical displays and summaries. We use the data RATS given by here: The data is from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period.

# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)

# Read the data
RATSL <- read.csv("/Users/teemupekkarinen/IODS-project/data/RATSL",row.names = 1,stringsAsFactors = T)
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ Times: Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rats : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time : int  1 1 1 1 1 1 1 1 1 1 ...
dim(RATSL)
## [1] 176   5
summary(RATSL)
##        ID            Group          Times         Rats            Time      
##  Min.   : 1.00   Min.   :1.00   WD1    :16   Min.   :225.0   Min.   : 1.00  
##  1st Qu.: 4.75   1st Qu.:1.00   WD15   :16   1st Qu.:267.0   1st Qu.:15.00  
##  Median : 8.50   Median :1.50   WD22   :16   Median :344.5   Median :36.00  
##  Mean   : 8.50   Mean   :1.75   WD29   :16   Mean   :384.5   Mean   :33.55  
##  3rd Qu.:12.25   3rd Qu.:2.25   WD36   :16   3rd Qu.:511.2   3rd Qu.:50.00  
##  Max.   :16.00   Max.   :3.00   WD43   :16   Max.   :628.0   Max.   :64.00  
##                                 (Other):80
# Factor ID & Group
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

# Take a glimpse at the RATSL data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Times <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1…
## $ Rats  <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# Table 
kable(RATSL[1:16,])
ID Group Times Rats Time
1 1 WD1 240 1
2 1 WD1 225 1
3 1 WD1 245 1
4 1 WD1 260 1
5 1 WD1 255 1
6 1 WD1 260 1
7 1 WD1 275 1
8 1 WD1 245 1
9 2 WD1 410 1
10 2 WD1 405 1
11 2 WD1 445 1
12 2 WD1 555 1
13 3 WD1 470 1
14 3 WD1 535 1
15 3 WD1 520 1
16 3 WD1 510 1
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Rats, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Rats), max(RATSL$Rats)))

That is, there are different 16 rats for which the table above gives the initial weight in day 1. The first figure shows how the weight of rats in each group evolve over the test period. Clearly there are differences in weights between groups. Moreover, in group 2 there is a clear outlier of a huge rat. It seems to be so that, in group there are the lightest rats, in group 2 the second lightest (exept the outlier rat), and in group the heaviest. In all groups the weight of rats increased during the test period. How much, is the question we try to answer next.

Once we standardise the weights of rats by grouping on each time period, we observe that the weight of rats in group 1 are all the time below the average (at each time) and rats in groups 2 and 3 above the average.

# Standardise the variable Rats
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdRats = (Rats - mean(Rats))/sd(Rats) ) %>%
  ungroup()

# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID      <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, …
## $ Group   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1…
## $ Times   <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, W…
## $ Rats    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4…
## $ Time    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8…
## $ stdRats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -…
# Plot again with the standardised Rats
ggplot(RATSL, aes(x = Time, y = stdRats, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized Rats")

Next we look at how the mean of rats in each group evolve over time. We observe that basically the means of weight of rats increase in every group

# Number of time periods, baseline (time 1) included
n <- RATSL$Time %>% unique() %>% length()

# Summary data with mean and standard error of Rats by Group and Time 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Rats), se = sd(Rats)/sqrt(n) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype=Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=4) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.5)) +
  scale_y_continuous(name = "mean(Rats) +/- se(Rats)")

Next, we create a summary data by Group and ID with mean as the summary variable (ignoring baseline time period 1). The box plots of the full data and without the outlier rat are given below.

# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline time period 1).
RATSL8S <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Rats) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Rats), time periods 2-64")

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- RATSL8S %>%
  filter(mean < 550)

# Glimpse the data
glimpse(RATSL8S1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Rats), time periods 2-64")

Finally, we perform a two-sample t-test and observe the differences as seen in the boxplots above. We compute the analysis of variance table for the fitted model. We see that the baseline RATS is strongly related to the Rats values taken after treatment (Group) has begun, but there is still no evidence of a treatment (Group) difference even after conditioning on the baseline value.

# Perform a two-sample t-test

# Let us put groups 2 and 3 together in order to run the t-test
RATSL8S1$Group[which(RATSL8S1$Group==2)] <- 3
t.test(mean~Group, data = RATSL8S1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -14.554, df = 13, p-value = 2.002e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -264.4733 -196.1053
## sample estimates:
## mean in group 1 mean in group 3 
##        265.0250        495.3143
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATSL$Rats[which(RATSL$Times=="WD1")])

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.905  -4.194   2.190   7.577  14.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.16375   21.87657   1.516   0.1554    
## baseline     0.92513    0.08572  10.793 1.56e-07 ***
## Group2      34.85753   18.82308   1.852   0.0888 .  
## Group3      23.67526   23.25324   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.992 
## F-statistic: 622.1 on 3 and 12 DF,  p-value: 1.989e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
kable(anova(fit))
Df Sum Sq Mean Sq F value Pr(>F)
baseline 1 253625.4006 253625.4006 1859.820128 0.0000000
Group 2 878.7458 439.3729 3.221896 0.0758554
Residuals 12 1636.4512 136.3709 NA NA

Analysis of Longitudinal Data Part 2

In this part we go a little deeper to the analysis comparing to the Part 1. As in the course book it says:

“The summary measure approach to the analysis of longitudinal data described in the previous chapter sometimes provides a useful first step in making inferences about the data, but it is really only ever a first step, and a more complete and a more appropriate analysis will involve fitting a suitable model to the data and estimating parameters that link the explanatory variables of interest to the repeated measures of the response variable”

To that end, we use Linear Mixed Effects Models for repeated measures data. To begin to see how linear mixed effects models are applied in practice, we shall use some data from Davis (2002): “Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.”

We now describre the data in the similar way as for RATS data. The summaries and structure and a plot for the treatment group and control group is given. We observe the following

# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)

# Read the data
BPRSL <- read.csv("/Users/teemupekkarinen/IODS-project/data/BPRSL",row.names = 1,stringsAsFactors = T)
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
dim(BPRSL)
## [1] 360   5
summary(BPRSL)
##    treatment      subject          weeks          bprs            week  
##  Min.   :1.0   Min.   : 1.00   week0  : 40   Min.   :18.00   Min.   :0  
##  1st Qu.:1.0   1st Qu.: 5.75   week1  : 40   1st Qu.:27.00   1st Qu.:2  
##  Median :1.5   Median :10.50   week2  : 40   Median :35.00   Median :4  
##  Mean   :1.5   Mean   :10.50   week3  : 40   Mean   :37.66   Mean   :4  
##  3rd Qu.:2.0   3rd Qu.:15.25   week4  : 40   3rd Qu.:43.00   3rd Qu.:6  
##  Max.   :2.0   Max.   :20.00   week5  : 40   Max.   :95.00   Max.   :8  
##                                (Other):120
# Factor subject & treatment
BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)

# Take a glimpse at the RATSL data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks     <fct> week0, week0, week0, week0, week0, week0, week0, week0, wee…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# Table 
kable(BPRSL[1:20,])
treatment subject weeks bprs week
1 1 week0 42 0
1 2 week0 58 0
1 3 week0 54 0
1 4 week0 55 0
1 5 week0 72 0
1 6 week0 48 0
1 7 week0 71 0
1 8 week0 30 0
1 9 week0 41 0
1 10 week0 57 0
1 11 week0 30 0
1 12 week0 55 0
1 13 week0 36 0
1 14 week0 38 0
1 15 week0 66 0
1 16 week0 41 0
1 17 week0 45 0
1 18 week0 39 0
1 19 week0 24 0
1 20 week0 38 0
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Let us next jump into the regression model.

# create a regression model BPRSL_reg
BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRSL_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

From here we observe that there is a little, unsignificant positive effect of treatment. Moreover, there is a significant negative effect of time.

This model assumes independence of the repeated measures of brps, which is unrealistic. Next we consider the random intercept model. Fitting a random intercept model allows the linear regression fit for each subject to differ in intercept from other subjects.

# dplyr, tidyr, RATS and RATSL are available

# access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

The week fixed effects remain negative and significant as well as the treatment effect is positive but unsignificant. Correlation of fixed effects shows negative effects on both week and treatment.

Next we create a random intercept and random slope model. “Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the rats’ growth profiles, but also the effect of time.”

# create a random intercept and random slope model
BPRSL_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRSL_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRSL_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From here we observe that there are no very much different to the earlier results; actually all signs and significancies remain the same.

Finally, we can fit a random intercept and slope model that allows for a treatment \(\times\) week interaction.

# create a random intercept and random slope model
BPRSL_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRSL_ref2, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## BPRSL_ref2: bprs ~ week * treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRSL_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRSL_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs))) +
  scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))

# Create a vector of the fitted values
Fitted <- fitted(BPRSL_ref2)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(Fitted)

# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs))) +
  scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))

By adding the interaction term the sign of treament changes to negative. the coefficient of the interaction term is positive, but pretty unsignificant. The similar effects happen in the correlation of fixed effects.

We fit the model and observe that the the fit is pretty good.